Chapter 13: Exercises¶

In [1]:
library(tidyverse)
library(bayesrules)
library(bayesplot)
library(rstan)
library(rstanarm)
library(broom.mixed)
library(tidybayes)
library(ggmosaic)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
This is bayesplot version 1.10.0

- Online documentation and vignettes at mc-stan.org/bayesplot

- bayesplot theme set to bayesplot::theme_default()

   * Does _not_ affect other ggplot2 plots

   * See ?bayesplot_theme_set for details on theme setting

Loading required package: StanHeaders

rstan (Version 2.21.8, GitRev: 2e1f913d3ca3)

For execution on a local, multicore CPU with excess RAM we recommend calling
options(mc.cores = parallel::detectCores()).
To avoid recompilation of unchanged Stan programs, we recommend calling
rstan_options(auto_write = TRUE)


Attaching package: ‘rstan’


The following object is masked from ‘package:tidyr’:

    extract


Loading required package: Rcpp

This is rstanarm version 2.21.4

- See https://mc-stan.org/rstanarm/articles/priors for changes to default priors!

- Default priors may change, so it's safest to specify priors, even if equivalent to the defaults.

- For execution on a local, multicore CPU with excess RAM we recommend calling

  options(mc.cores = parallel::detectCores())


Attaching package: ‘rstanarm’


The following object is masked from ‘package:rstan’:

    loo


Exercise 13.6¶

In [2]:
head( hotel_bookings )
A tibble: 6 × 32
hotelis_canceledlead_timearrival_date_yeararrival_date_montharrival_date_week_numberarrival_date_day_of_monthstays_in_weekend_nightsstays_in_week_nightsadults⋯deposit_typeagentcompanydays_in_waiting_listcustomer_typeaverage_daily_raterequired_car_parking_spacestotal_of_special_requestsreservation_statusreservation_status_date
<fct><fct><dbl><dbl><fct><dbl><dbl><dbl><dbl><dbl>⋯<fct><fct><fct><dbl><fct><dbl><dbl><dbl><fct><date>
City Hotel 1 12015September4030021⋯Non Refund50 NULL0Transient 98.1000Canceled 2015-09-29
Resort Hotel1 192016March 1219242⋯No Deposit240NULL0Transient 70.1701Canceled 2016-03-02
Resort Hotel0 92017August 31 1042⋯No Deposit241NULL0Transient193.4001Check-Out2017-08-05
Resort Hotel01102016November 4611012⋯No Deposit314NULL0Transient 36.2410Check-Out2016-11-12
City Hotel 03292017July 3027022⋯No Deposit9 NULL0Transient 89.1001Check-Out2017-07-29
Resort Hotel02122017August 3531282⋯No Deposit143NULL0Transient 89.7500Check-Out2017-09-10

a)¶

In [3]:
mean( hotel_bookings$is_canceled == '1' )
0.366

37% of bookings are canceled. This is a lot!

b)¶

Compute means per class:

In [4]:
hotel_bookings %>% 
    select( is_canceled, lead_time, previous_cancellations, is_repeated_guest, average_daily_rate ) %>% 
    group_by( is_canceled ) %>% 
    summarise_all( list(mean=mean, sd=sd) )
A tibble: 2 × 9
is_canceledlead_time_meanprevious_cancellations_meanis_repeated_guest_meanaverage_daily_rate_meanlead_time_sdprevious_cancellations_sdis_repeated_guest_sdaverage_daily_rate_sd
<fct><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
0 83.458990.0078864350.03943218 97.86945 93.901040.16368860.194774546.75579
1142.491800.2677595630.01912568106.99902116.210101.90296060.137154243.31366

From a comparison of mean differences and standard deviations, no strong separations of density plots are expected.

Lead time¶

Linear scale:

In [5]:
options(repr.plot.width=15, repr.plot.height=5)
plot <- ggplot( hotel_bookings, aes(x=lead_time, fill=is_canceled)) + geom_density( alpha=0.5 ) 
plot

Log scale:

In [6]:
plot + scale_x_log10()
Warning message:
“Transformation introduced infinite values in continuous x-axis”
Warning message:
“Removed 44 rows containing non-finite values (`stat_density()`).”

Boxplot:

In [7]:
ggplot( hotel_bookings, aes(x=lead_time, y=is_canceled) ) + geom_boxplot()

It appears that the longer the room is booked in advanced, the larger the probability of cancelation. This makes sense, if one books only 1-2 weeks before, then one is typically sure that everything will work out.

Previous cancellations¶

Check distribution of previous cancellations:

In [8]:
hotel_bookings %>% 
    group_by( previous_cancellations ) %>% 
    summarise( n=n() )
A tibble: 5 × 2
previous_cancellationsn
<dbl><int>
0949
1 48
4 1
25 1
26 1

Most of the guests did not cancel previously. Considering the actual number of previous cancellations might lead to severe overfitting. As an alternative, one could include a binary variable 'has_canceled_previously'.

Density plot:

In [9]:
ggplot( hotel_bookings, aes(x=previous_cancellations, fill=is_canceled)) + 
    geom_density( alpha=0.5 ) + 
    scale_x_log10()
Warning message:
“Transformation introduced infinite values in continuous x-axis”
Warning message:
“Removed 949 rows containing non-finite values (`stat_density()`).”

This is a dangerous visualisation and it is clearly visible that the data are sparse.

Mosaic plot:

In [10]:
ggplot( data=hotel_bookings %>% mutate(has_canceled_previously=previous_cancellations>0) ) + 
    geom_mosaic( aes(x=product(is_canceled, has_canceled_previously), fill=is_canceled) )
Warning message:
“`unite_()` was deprecated in tidyr 1.2.0.
ℹ Please use `unite()` instead.
ℹ The deprecated feature was likely used in the ggmosaic package.
  Please report the issue at <https://github.com/haleyjeppson/ggmosaic>.”

It appears that a previous cancellation is an indicator for a cancellation now.

Table:

In [11]:
hotel_bookings %>% 
    group_by( is_canceled, previous_cancellations>0 ) %>% 
    summarise( n=n() )
`summarise()` has grouped output by 'is_canceled'. You can override using the
`.groups` argument.
A grouped_df: 4 × 3
is_canceledprevious_cancellations > 0n
<fct><lgl><int>
0FALSE632
0 TRUE 2
1FALSE317
1 TRUE 49

Repeated guests¶

Mosaic plot:

In [12]:
ggplot( data=hotel_bookings ) + geom_mosaic( aes(x=product(is_canceled, is_repeated_guest), fill=is_canceled) )

Table:

In [13]:
hotel_bookings %>% 
    group_by( is_canceled, is_repeated_guest ) %>% 
    summarise( n=n() )
`summarise()` has grouped output by 'is_canceled'. You can override using the
`.groups` argument.
A grouped_df: 4 × 3
is_canceledis_repeated_guestn
<fct><dbl><int>
00609
01 25
10359
11 7

Of the few repeated guests, slightly less appear to cancel their reservation, however the numbers are too low for a clear answer. There is also the risk for overfitting to the small number here.

Average daily rate¶

In [14]:
ggplot( hotel_bookings, aes(x=average_daily_rate, fill=is_canceled)) + geom_density( alpha=0.5 )

The differences appear to be small and probably not significant. At least there is no clear pattern that emerges.

c)¶

Prior on intercept: Reasonable range for $\pi$ between 0.1 and 0.8, centered at 0.4. In terms of log-odds:

In [15]:
log(0.4/(1-0.4))
-0.405465108108164

A probability of 0.1 corresponds to a log-odds of $\approx -2$ and 0.8 to a log-odds of $\approx 1.4$.

In [16]:
log(0.1/(1-0.1))
-2.19722457733622
In [17]:
log(0.8/(1-0.8))
1.38629436111989

This can be more or less reproduced with a Gaussian prior centered around -0.4 with a standard deviation of 0.8:

In [18]:
plot_normal( mean=-0.4, sd=0.8 )

Resulting model:

$$Y_i|\beta_0,\beta_1,\beta_2,\beta_3,\beta_4 \sim \text{Bern}(\pi_i), \quad \log\left(\frac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \beta_4 X_{4i} $$$$\beta_0 \sim N(0.4,0.8)$$$$\beta_1 \sim N(0,\cdot)$$$$\beta_2 \sim N(0,\cdot)$$$$\beta_3 \sim N(0,\cdot)$$$$\beta_4 \sim N(0,\cdot)$$

where $\cdot$ stands for autoscale=TRUE (weakly informed prior).

d)¶

Since the outcome is of (binary) categorical nature, it is best modeled with a Bernoulli distribution. The parameter $\pi$ of the distribution needs to be modeled as a probablity in the interval [0,1], what is done best with a sigmoid function or equally by modeling the log-odds with a linear ansatz. Since $\beta_1$, .., $\beta_4$ are again parameters of a linear model, we can assume Gaussian priors as usual.

Exercise 13.7¶

a)¶

Simulate the model:

In [19]:
booking_model_1 <- stan_glm( is_canceled ~ lead_time + previous_cancellations + is_repeated_guest + average_daily_rate,
                             data = hotel_bookings, family = binomial,
                             prior_intercept = normal(0.4, 0.8),
                             prior = normal(0, 1, autoscale=TRUE),
                             chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2.5e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.554266 seconds (Warm-up)
Chain 1:                0.611307 seconds (Sampling)
Chain 1:                1.16557 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2.1e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.582348 seconds (Warm-up)
Chain 2:                0.607858 seconds (Sampling)
Chain 2:                1.19021 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.2e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.54983 seconds (Warm-up)
Chain 3:                0.618881 seconds (Sampling)
Chain 3:                1.16871 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 2e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.549079 seconds (Warm-up)
Chain 4:                0.596707 seconds (Sampling)
Chain 4:                1.14579 seconds (Total)
Chain 4: 

Prior summary:

In [20]:
prior_summary( booking_model_1 )
Priors for model 'booking_model_1' 
------
Intercept (after predictors centered)
 ~ normal(location = 0.4, scale = 0.8)

Coefficients
  Specified prior:
    ~ normal(location = [0,0,0,...], scale = [1,1,1,...])
  Adjusted prior:
    ~ normal(location = [0,0,0,...], scale = [0.0094,0.8588,5.6790,...])
------
See help('prior_summary.stanreg') for more details

Diagnostics:

In [21]:
mcmc_trace( booking_model_1 )
In [22]:
mcmc_dens_overlay( booking_model_1 )
In [23]:
mcmc_acf( booking_model_1 )
Warning message:
“The `facets` argument of `facet_grid()` is deprecated as of ggplot2 2.2.0.
ℹ Please use the `rows` argument instead.
ℹ The deprecated feature was likely used in the bayesplot package.
  Please report the issue at <https://github.com/stan-dev/bayesplot/issues/>.”

Looks good!

Posterior predictive check:

In [24]:
proportion_of_cancellations <- function(x){mean(x == 1)}
pp_check(booking_model_1, nreps = 100,
         plotfun = "stat", stat = "proportion_of_cancellations") + xlab("probability of cancellations")
Warning message:
“'nreps' is ignored for this PPC”
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Appears to be in an ok range.

b)¶

In [25]:
tidy(booking_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.80)
A tibble: 5 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept) -1.9091407260.206106095-2.178027271-1.645967782
lead_time 0.0048249580.000689820 0.003950848 0.005728844
previous_cancellations 2.2166400920.387012270 1.733796500 2.716642477
is_repeated_guest -0.7456517350.544346703-1.469581416-0.080961138
average_daily_rate 0.0072387420.001571894 0.005247197 0.009247278
$$\log\left(\frac{\pi}{1-\pi}\right) = -1.9 + 0.0048 X_1 + 2.2 X_2 -0.75 X_3 + 0.007 X_4$$$$\frac{\pi}{1-\pi} = \exp\left(-1.9 + 0.0048 X_1 + 2.2 X_2 -0.75 X_3 + 0.007 X_4\right)$$$$\pi = \frac{1}{1+e^{-z}}, \quad z = -1.9 + 0.0048 X_1 + 2.2 X_2 -0.75 X_3 + 0.007 X_4$$

c)¶

$\beta_2$:

In [26]:
c( exp(1.73), exp(2.22), exp(2.72) )
  1. 5.64065390842832
  2. 9.20733086588225
  3. 15.1803222449539

With every previous cancellation, the odds of a cancellation increase by a factor between 5.6 and 15.2! (As stated before, I think this is very dangerous, I would prefer a field 'has_canceled_previously')

$\beta_3$:

In [27]:
c( exp(-1.47), exp(-0.75), exp(-0.08))
  1. 0.229925485186724
  2. 0.472366552741015
  3. 0.923116346386636

If the guest has stayed at the hotel before, the odds of cancellation appear to be between cut to one fourth or almost staying the same, what is quite a big uncertainty (also here I see a risk for overfitting to the small number).

d)¶

95% confidence interval:

In [28]:
tidy(booking_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.95)
A tibble: 5 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept) -1.9091407260.206106095-2.325342491-1.508698267
lead_time 0.0048249580.000689820 0.003494805 0.006213864
previous_cancellations 2.2166400920.387012270 1.493876016 3.010409575
is_repeated_guest -0.7456517350.544346703-1.890735848 0.226827159
average_daily_rate 0.0072387420.001571894 0.004203853 0.010290770

At the 95%-level it appears that lead_time, previous_cancellations and average daily rate are significantly related to the cancellation rate. The 95%-CI for is_repeated_guest encloses zero and is therefore not significant at this level. From the considerations in exercise 13.7 I have however my doubts how well 'previous_cancellations' and 'average_daily_rate' work in reality.

Exercise 13.8¶

a)¶

In-sample:

In [29]:
classification_summary(model = booking_model_1, data = hotel_bookings, cutoff = 0.5)
$confusion_matrix
A tabyl: 2 × 3
y01
<fct><dbl><dbl>
0574 60
1250116
$accuracy_rates
A data.frame: 3 × 1
<dbl>
sensitivity0.3169399
specificity0.9053628
overall_accuracy0.6900000

CV:

In [30]:
cv_accuracy <- classification_summary_cv(model = booking_model_1, data = hotel_bookings, cutoff = 0.5, k = 10)
cv_accuracy$cv
A data.frame: 1 × 3
sensitivityspecificityoverall_accuracy
<dbl><dbl><dbl>
0.30194130.90480410.684
In [31]:
cv_accuracy$folds$overall_accuracy
  1. 0.7
  2. 0.61
  3. 0.67
  4. 0.66
  5. 0.64
  6. 0.68
  7. 0.66
  8. 0.71
  9. 0.71
  10. 0.8

b)¶

Sensitivity, specificity and overall accuracy are very similar. In general this makes sense, because it is hard for a linear model to overfit to data. However I am surprised that the low numbers in previous_cancellations and is_repeated_guest did not have a bigger impact.

c)¶

The model will correctly predict whether a room is canceled or not in 69% of the cases. Out of all the rooms that will be canceled, the model can only find 32% and out of all rooms that will not be canceled, it can find 91%.

d)¶

In [32]:
find_sens_spec <- function( cutoff ) {
    sm <- classification_summary(model = booking_model_1, data = hotel_bookings, cutoff = cutoff)
    c( cutoff, sm$accuracy_rates[[1]] )
}

df <- data.frame( matrix(nrow=0, ncol=4) )
colnames(df) <- c("cutoff", "sensitivity", "specificity", "overall_accuracy")
for (c in seq(0.1,0.95,0.05)) {
    df[nrow(df)+1,] = find_sens_spec( c )
}
In [33]:
df %>% 
    pivot_longer( sensitivity:overall_accuracy, names_to="metric" ) %>% 
    ggplot( aes(x=cutoff, y=value, color=metric) ) +
    geom_line() +
    geom_point() +
    geom_hline( yintercept=0.75, linetype="dashed", color = "black" )
In [34]:
df %>% filter( abs(sensitivity-0.75) < 0.2 )
A data.frame: 3 × 4
cutoffsensitivityspecificityoverall_accuracy
<dbl><dbl><dbl><dbl>
0.250.89071040.33596210.539
0.300.78415300.50473190.607
0.350.62568310.65457410.644

For a sensitivity close to 75%, i.e. capturing about 75% of all cancellations, a cutoff threshold of 0.3 needs to be set. This results in a specificity of 51%, i.e. half of all non-cancellations are predicted as cancellations (false positives).

Exercise 3.9¶

a)¶

Simulate:

In [35]:
booking_model_1_df <- data.frame(booking_model_1)
preds <- booking_model_1_df %>% 
    mutate( mu=X.Intercept. + lead_time*30 + is_repeated_guest*0 + previous_cancellations*1 + average_daily_rate*100 ) %>% 
    mutate( odds=exp(mu), prob=odds/(odds+1) ) %>% 
    mutate( Y=rbinom(nrow(booking_model_1_df), size=1, prob=prob) )
head( preds )
A data.frame: 6 × 9
X.Intercept.lead_timeprevious_cancellationsis_repeated_guestaverage_daily_ratemuoddsprobY
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int>
1-2.0673500.0048190272.652966-0.64393370.0086660921.59679564.9371860.83157000
2-2.0049140.0037553342.524631-0.51014370.0088978621.52216344.5821280.82085681
3-2.0786810.0045115122.720257-1.19241800.0090520371.68212435.3769660.84318561
4-2.2111780.0046333202.262709-0.63888020.0099744151.18797143.2804200.76637811
5-1.6192270.0046811031.912430-0.75226820.0054921690.98285262.6720680.72767391
6-1.6904050.0056372592.433125-0.70606440.0045182251.36366043.9104810.79635401

Distribution of probabilities:

In [36]:
ggplot( preds ) + geom_histogram( aes(x=prob) ) + geom_vline( xintercept=0.5, color="red", linetype="dashed" )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Distribution of model predictions:

In [37]:
ggplot( preds, aes(x = Y) ) + geom_bar( aes(y=..prop.., group=1))
Warning message:
“The dot-dot notation (`..prop..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(prop)` instead.”
In [38]:
mean( preds$Y )
0.7589

The model predicts a cancellation in 76% of all the time. In a bet I would place my money in favour of the guest cancelling.

b)¶

This can for example be achieved by setting the number of previous cancellations higher. Then it does not really matter what the rest of the variables is:

In [39]:
preds2 <- booking_model_1_df %>% 
    mutate( mu=X.Intercept. + lead_time*30 + is_repeated_guest*0 + previous_cancellations*5 + average_daily_rate*250 ) %>% 
    mutate( odds=exp(mu), prob=odds/(odds+1) ) %>% 
    mutate( Y=rbinom(nrow(booking_model_1_df), size=1, prob=prob) )
mean(preds2$Y)
1
In [40]:
ggplot( preds2 ) + geom_histogram( aes(x=prob) ) + geom_vline( xintercept=0.5, color="red", linetype="dashed" )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

I would advocate here for not treating previous_cancellations as a numeric variable.

Exercise 13.10¶

In [42]:
robot_data <- pulse_of_the_nation %>% select( robots, transformers, books, age, income )
head( robot_data )
A tibble: 6 × 5
robotstransformersbooksageincome
<fct><dbl><dbl><dbl><dbl>
Unlikely12064 8
Unlikely0 656 68
Unlikely0 063 46
Unlikely0 148 51
Unlikely13032100
Unlikely01564 54

a)¶

In [43]:
robot_data %>% group_by( robots ) %>% summarise( unlikely_prop=n()/nrow(robot_data) )
A tibble: 2 × 2
robotsunlikely_prop
<fct><dbl>
Likely 0.199
Unlikely0.801

Around 80% believe that it is unlikely.

b)¶

Transformers¶

In [44]:
robot_data %>% 
    group_by( transformers ) %>% 
    summarise( count=n() )
A tibble: 6 × 2
transformerscount
<dbl><int>
0459
1157
2155
3111
4 67
5 51
In [45]:
ggplot( data=robot_data ) + geom_mosaic( aes(x=product(robots, transformers), fill=robots) )

It appears that watching 2 or more transformer movies very slightly increases the belief that robots will take over. This might not be significant.

Books¶

In [46]:
ggplot( robot_data ) + geom_histogram( aes(x=books, fill=robots), alpha=0.5 )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

It looks like people who read few books believe it is more likely that robots will take over the world, however this is might not be significant.

Age¶

In [47]:
ggplot( robot_data ) + geom_density( aes(x=age, fill=robots), alpha=0.5 )

It looks like people above 70 believe it is more likely that robots will take over the world.

Income¶

In [48]:
ggplot( robot_data ) + geom_density( aes(x=income, fill=robots), alpha=0.5 )

It appears that people with low income believe it is more likely that robots will take over the world.

c)¶

Prior on intercept: Reasonable range for $\pi$ between 0.5 and 0.95, centered at 0.8. In terms of log-odds:

In [49]:
log(0.8/(1-0.8))
1.38629436111989

A probability of 0.5 corresponds to a log-odds of $0$ and 0.95 to a log-odds of $\approx 3$.

In [50]:
log(0.5/(1-0.5))
0
In [51]:
log(0.95/(1-0.95))
2.94443897916644

This can be more or less reproduced with a Gaussian prior centered around 1.4 with a standard deviation of 0.8:

In [52]:
plot_normal( mean=1.4, sd=0.8 )

Resulting model:

$$Y_i|\beta_0,\beta_1,\beta_2,\beta_3,\beta_4 \sim \text{Bern}(\pi_i), \quad \log\left(\frac{\pi_i}{1-\pi_i}\right) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \beta_4 X_{4i} $$$$\beta_0 \sim N(0.4,0.8)$$$$\beta_1 \sim N(0,\cdot)$$$$\beta_2 \sim N(0,\cdot)$$$$\beta_3 \sim N(0,\cdot)$$$$\beta_4 \sim N(0,\cdot)$$

where $\cdot$ stands for autoscale=TRUE (weakly informed prior).

d)¶

Since the outcome (likely/unlikely) is of binary categorical nature, it is best modeled with a Bernoulli distribution. The parameter $\pi$ of the distribution needs to be modeled as a probablity in the interval [0,1], what is done best with a sigmoid function or equally by modeling the log-odds with a linear ansatz. Since $\beta_1$, .., $\beta_4$ are again parameters of a linear model, we can assume Gaussian priors as usual.

Exercise 13.11¶

a)¶

Simulate the model:

In [53]:
robot_model_1 <- stan_glm( robots ~ transformers + books + age + income,
                             data = robot_data, family = binomial,
                             prior_intercept = normal(1.4, 0.8),
                             prior = normal(0, 1, autoscale=TRUE),
                             chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2.4e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.464295 seconds (Warm-up)
Chain 1:                0.559222 seconds (Sampling)
Chain 1:                1.02352 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2.1e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.482915 seconds (Warm-up)
Chain 2:                0.519015 seconds (Sampling)
Chain 2:                1.00193 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.9e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.29 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.470337 seconds (Warm-up)
Chain 3:                0.527173 seconds (Sampling)
Chain 3:                0.99751 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 2.2e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.4765 seconds (Warm-up)
Chain 4:                0.570284 seconds (Sampling)
Chain 4:                1.04678 seconds (Total)
Chain 4: 

Prior summary:

In [54]:
prior_summary( robot_model_1 )
Priors for model 'robot_model_1' 
------
Intercept (after predictors centered)
 ~ normal(location = 1.4, scale = 0.8)

Coefficients
  Specified prior:
    ~ normal(location = [0,0,0,...], scale = [1,1,1,...])
  Adjusted prior:
    ~ normal(location = [0,0,0,...], scale = [0.649,0.013,0.060,...])
------
See help('prior_summary.stanreg') for more details

Diagnostics:

In [55]:
mcmc_trace( robot_model_1 )
In [56]:
mcmc_dens_overlay( robot_model_1 )
In [57]:
mcmc_acf( robot_model_1 )

Looks good!

Posterior predictive check:

In [58]:
proportion_of_unbelievers <- function(x){mean(x == 1)}
pp_check(robot_model_1, nreps = 100,
         plotfun = "stat", stat = "proportion_of_unbelievers") + xlab("probability of not believing")
Warning message:
“'nreps' is ignored for this PPC”
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Appears to be in an ok range.

b)¶

In [59]:
tidy(robot_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.80)
A tibble: 5 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept) 1.1393348210.3480184360 0.692670549 1.582254e+00
transformers-0.1215887250.0510392103-0.188425271-5.465837e-02
books -0.0020801050.0009633804-0.003376087-8.505848e-04
age -0.0064618260.0050434602-0.012965282-5.122908e-06
income 0.0098006100.0019556609 0.007365586 1.237479e-02
$$\log\left(\frac{\pi}{1-\pi}\right) = 1.14 - 0.12 X_1 - 0.002 X_2 - 0.006 X_3 + 0.010 X_4$$$$\frac{\pi}{1-\pi} = \exp\left(1.14 - 0.12 X_1 - 0.002 X_2 - 0.006 X_3 + 0.010 X_4\right)$$$$\pi = \frac{1}{1+e^{-z}}, \quad z = 1.14 - 0.12 X_1 - 0.002 X_2 - 0.006 X_3 + 0.010 X_4$$

c)¶

$\beta_3$:

In [60]:
c( exp(-0.013), exp(-0.006), exp(-5.1e-6) )
  1. 0.987084135020288
  2. 0.994017964053935
  3. 0.999994900013005

Age has a very small impact. With every year more, around 0-1% more people believe that robots take over the world.

$\beta_4$:

In [61]:
c( exp(0.0074), exp(0.0098), exp(0.012) )
  1. 1.00742744766246
  2. 1.00984817725041
  3. 1.01207228886608

With every dollar of income, between 0-1% less people believe that robots take over the world.

d)¶

95% confidence interval:

In [62]:
tidy(robot_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.95)
A tibble: 5 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept) 1.1393348210.3480184360 0.453066782 1.8229782306
transformers-0.1215887250.0510392103-0.223538160-0.0190189544
books -0.0020801050.0009633804-0.004142485-0.0001981823
age -0.0064618260.0050434602-0.016463324 0.0033468971
income 0.0098006100.0019556609 0.006090984 0.0137529405

At the 95%-level it appears that transformers, book and income are significantly related to the belief that robots take over the world within the next 10 years. Age does not appear to be significantly related to this belief.

At the 99% confidence level, also transformers and books drop out, what is somehow expected:

In [63]:
tidy(robot_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.99)
A tibble: 5 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept) 1.1393348210.3480184360 0.2408989752.0484447280
transformers-0.1215887250.0510392103-0.2548416330.0140178974
books -0.0020801050.0009633804-0.0049027210.0003656827
age -0.0064618260.0050434602-0.0198593590.0064090354
income 0.0098006100.0019556609 0.0050025340.0149679852

I could somehow imagine that number of books and income somehow correlate with education level and that age is correlated with understanding of technology. Probably the number of transformer movies is confounded with another quantity, I don't see directly how it impacts belief/non-belief in that robots will take over the world.

In [64]:
ggplot( robot_data ) + geom_boxplot( aes(x=age, y=transformers, group=transformers))

It appears that the number of watched transformer movies is also correlated with age and then also income.

Exercise 13.12¶

a)¶

In [65]:
robot_data %>% summarize_all( function(x) {mean(is.na(x))} )
A tibble: 1 × 5
robotstransformersbooksageincome
<dbl><dbl><dbl><dbl><dbl>
00000

In-sample:

In [66]:
classification_summary(model = robot_model_1, data = robot_data, cutoff = 0.6)
$confusion_matrix
A tabyl: 2 × 3
y01
<fct><dbl><dbl>
Likely 8191
Unlikely7794
$accuracy_rates
A data.frame: 3 × 1
<dbl>
sensitivity0.99126092
specificity0.04020101
overall_accuracy0.80200000

CV (Use 5 folds - there appears to be a problem with the cutoff, leading to missing values):

In [67]:
cv_accuracy <- classification_summary_cv(model = robot_model_1, data = robot_data, cutoff = 0.6, k = 5)
cv_accuracy$cv
A data.frame: 1 × 3
sensitivityspecificityoverall_accuracy
<dbl><dbl><dbl>
0.99132890.036829640.801
In [68]:
cv_accuracy$folds
A data.frame: 5 × 4
foldsensitivityspecificityoverall_accuracy
<int><dbl><dbl><dbl>
10.99358970.022727270.780
20.99397590.058823530.835
30.99367090.000000000.785
40.98181820.057142860.820
50.99358970.045454550.785

Interestingly, the sensitivity is extremely large (if somebody does not believe that robots take over the world we capture them with quite some confidence), however the specificity is quite low, practically zero. It appears that our model cannot really predict who believes that robots take over the world with the given cutoff.

b)¶

Sensitivity, specificity and overall accuracy are more or less similar. In general this makes sense, because it is hard for a linear model to overfit to data.

c)¶

The model practically predicts 'unlikely' for almost all data points. With the given class proportion of 0.8, this results in a default accuracy of 80%. Obviously, the sensitivity is almost 100% and the specificity pracically zero percent.

Exercise 13.13¶

In [69]:
fake_news <- fake_news %>% select( type, title_has_excl, title_words, negative )
head( fake_news )
A data.frame: 6 × 4
typetitle_has_excltitle_wordsnegative
<fct><lgl><int><dbl>
1fakeFALSE178.47
2realFALSE184.74
3fake TRUE163.33
4realFALSE116.09
5fakeFALSE 92.66
6realFALSE123.02

Exploratory Data Analysis¶

Exclamation mark¶

In [70]:
ggplot( data=fake_news ) + geom_mosaic( aes(x=product(type, title_has_excl), fill=type) )

There appears to be a significant relationship between exclamation marks and fake news.

Number of words¶

In [71]:
ggplot( data=fake_news ) + geom_density( aes(x=title_words, fill=type), alpha=0.5 )

Fake news appear to have more words in the title.

Percent words with a negative sentiment¶

In [72]:
ggplot( data=fake_news ) + geom_density( aes(x=negative, fill=type), alpha=0.5 )

Fake news convey more often words with a negative sentiment.

Specify prior for intercept¶

Expect around 40% of fake news, but could also reasonably be between 10-90%.

In [73]:
log(0.4/(1-0.4))
-0.405465108108164
In [74]:
log(0.1/(1-0.1))
-2.19722457733622
In [75]:
log(0.9/(1-0.9))
2.19722457733622

The uncertainty in prior odds is reflected with the following normal distribution:

In [76]:
plot_normal( mean=-0.4, sd=0.8 )

Simulate posterior¶

In [77]:
fake_news <- fake_news %>% 
    mutate( is_fake=as.numeric(type=="fake") ) %>% 
    select( -type )
In [78]:
fake_model_1 <- stan_glm( is_fake ~ title_has_excl + title_words + negative,
                             data = fake_news, family = binomial,
                             prior_intercept = normal(-0.4, 0.8),
                             prior = normal(0, 1, autoscale=TRUE),
                             chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 1.3e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.13 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.120301 seconds (Warm-up)
Chain 1:                0.139981 seconds (Sampling)
Chain 1:                0.260282 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.1e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.123034 seconds (Warm-up)
Chain 2:                0.146218 seconds (Sampling)
Chain 2:                0.269252 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1.1e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.12657 seconds (Warm-up)
Chain 3:                0.151761 seconds (Sampling)
Chain 3:                0.278331 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1.1e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.131461 seconds (Warm-up)
Chain 4:                0.143539 seconds (Sampling)
Chain 4:                0.275 seconds (Total)
Chain 4: 
In [79]:
mcmc_trace( fake_model_1 )
In [80]:
mcmc_dens_overlay( fake_model_1 )
In [81]:
mcmc_acf( fake_model_1 )

Posterior predictive check¶

In [82]:
proportion_of_fake <- function(x){mean(x == 1)}
pp_check(fake_model_1, nreps = 100,
         plotfun = "stat", stat = "proportion_of_fake") + xlab("probability of news being fake")
Warning message:
“'nreps' is ignored for this PPC”
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The posterior distribution was narrowed down quite a bit!

Parameter ranges¶

In [83]:
tidy(fake_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.95)
A tibble: 4 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept) -2.90752190.75951530-4.452530849-1.4716072
title_has_exclTRUE 2.44346880.75895334 1.107301202 4.2078433
title_words 0.11132200.05745054-0.002195581 0.2253234
negative 0.31783380.15404234 0.028119176 0.6258343

The number of words in the title appears not to be significant at the 95%-level.

Exclamation mark¶

In [84]:
c( exp(1.11), exp(2.44), exp(4.21) )
  1. 3.03435839443568
  2. 11.4730407427948
  3. 67.3565398101166

If an exclamation mark is present, the odds for the news being fake increase by a factor between 3 and 67, centered at 11. This is quite a big range, more data are needed here.

Number of words in title¶

In [85]:
c( exp(-0.00022), exp(0.11), exp(0.23) )
  1. 0.999780024198225
  2. 1.11627807045887
  3. 1.25860000992948

For every additional word, the odds for fake news increase by a factor of 1-1.26.

Percent words with negative sentiment¶

In [86]:
c( exp(0.028), exp(0.32), exp(0.63) )
  1. 1.02839568442143
  2. 1.37712776433596
  3. 1.87761057926434

Each percent of words with negative sentiment increases the odds for fake news by a factor of 1.02-1.88 (2-88%).

Overall the ranges are quite wide because of the limited amount of available data.

Performance metrics¶

In-sample:

In [87]:
classification_summary(model = fake_model_1, data = fake_news, cutoff = 0.5)
$confusion_matrix
A tabyl: 2 × 3
y01
<dbl><dbl><dbl>
084 6
13228
$accuracy_rates
A data.frame: 3 × 1
<dbl>
sensitivity0.4666667
specificity0.9333333
overall_accuracy0.7466667

CV:

In [88]:
cv_accuracy <- classification_summary_cv(model = fake_model_1, data = fake_news, cutoff = 0.5, k = 10)
cv_accuracy$cv
A data.frame: 1 × 3
sensitivityspecificityoverall_accuracy
<dbl><dbl><dbl>
0.34003970.93861110.7

Sensitivity, specificity and accuracy vary quite a bit!

In [89]:
cv_accuracy$folds
A data.frame: 10 × 4
foldsensitivityspecificityoverall_accuracy
<int><dbl><dbl><dbl>
10.66666670.88888890.8000000
20.20000000.90000000.6666667
30.00000000.88888890.5333333
40.00000000.91666670.7333333
50.44444441.00000000.6666667
60.28571431.00000000.6666667
70.66666670.91666670.8666667
80.42857140.87500000.6666667
90.37500001.00000000.6666667
100.33333331.00000000.7333333

Using a smaller K (K=10 means only 15 data points for hold-out set, use rather K=5 with 30 points):

In [90]:
cv_accuracy2 <- classification_summary_cv(model = fake_model_1, data = fake_news, cutoff = 0.5, k = 5)
cv_accuracy2$cv
A data.frame: 1 × 3
sensitivityspecificityoverall_accuracy
<dbl><dbl><dbl>
0.4670940.93454720.7466667
In [91]:
cv_accuracy2$folds
A data.frame: 5 × 4
foldsensitivityspecificityoverall_accuracy
<int><dbl><dbl><dbl>
10.58333330.94444440.8000000
20.44444440.90476190.7666667
30.53846150.94117650.7666667
40.30769230.94117650.6666667
50.46153850.94117650.7333333

sensitivity, specificity and accuracy are varying less, but still quite a bit.

Exercise 13.14¶

Select some interesting predictors that might correlate with believing in ghosts¶

In [92]:
pulse_data <- pulse_of_the_nation %>% 
    select( ghosts, income, age, education, science_is_honest, trump_approval )
head( pulse_data )
A tibble: 6 × 6
ghostsincomeageeducationscience_is_honesttrump_approval
<fct><dbl><dbl><fct><fct><fct>
Yes 864College degreeStrongly Agree Strongly disapprove
No 6856High school Somewhat Agree Strongly disapprove
No 4663Some college Somewhat Agree Somewhat Approve
No 5148High school Somewhat DisagreeStrongly Approve
Yes10032Some college Strongly Agree Somewhat Approve
No 5464Some college Strongly Agree Strongly disapprove

Exploratory Data Analysis¶

Income¶

In [93]:
ggplot( pulse_data ) + geom_density( aes(x=income, fill=ghosts), alpha=0.5 )

no obvious correlations, mabye people with medium incomes believe less in ghosts.

Age¶

In [94]:
ggplot( pulse_data ) + geom_density( aes(x=age, fill=ghosts), alpha=0.5 )

No obvious correlations, older people appear to believe even so slightly more in ghosts.

Education¶

In [95]:
ggplot( pulse_data ) + geom_mosaic( aes(x=product(ghosts, education), fill=ghosts) )

There appears to be a slight correlation between believing in ghosts and education, however not as strong as suspected.

Science is honest¶

In [96]:
ggplot( pulse_data ) + geom_mosaic( aes(x=product(ghosts, science_is_honest), fill=ghosts) )

Very small differences, probably not significant.

Trump approval¶

In [97]:
ggplot( pulse_data ) + geom_mosaic( aes(x=product(ghosts, trump_approval), fill=ghosts) )

Also here probably not significant.

Specify prior for intercept¶

Expect around 60% who believe in ghosts, but could also reasonably be between 20-90%.

In [98]:
log(0.6/(1-0.6))
0.405465108108164
In [99]:
log(0.2/(1-0.2))
-1.38629436111989
In [100]:
log(0.9/(1-0.9))
2.19722457733622

The uncertainty in prior odds is more or less reflected with the following normal distribution:

In [101]:
plot_normal( mean=0.4, sd=0.6 )

Simulate posterior¶

In [102]:
pulse_data <- pulse_data %>% 
    mutate( ghosts=as.numeric(ghosts=="Yes") )
In [103]:
pulse_model_1 <- stan_glm( ghosts ~ age + income + education + science_is_honest + trump_approval,
                             data = pulse_data, family = binomial,
                             prior_intercept = normal(0.4, 0.6),
                             prior = normal(0, 1, autoscale=TRUE),
                             chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 3.3e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.33 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 2.80784 seconds (Warm-up)
Chain 1:                2.69246 seconds (Sampling)
Chain 1:                5.5003 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 3.2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.32 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 2.85831 seconds (Warm-up)
Chain 2:                2.94854 seconds (Sampling)
Chain 2:                5.80686 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.2e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 2.87614 seconds (Warm-up)
Chain 3:                3.39404 seconds (Sampling)
Chain 3:                6.27018 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 2.8e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.28 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 3.63366 seconds (Warm-up)
Chain 4:                4.02295 seconds (Sampling)
Chain 4:                7.6566 seconds (Total)
Chain 4: 
In [104]:
mcmc_trace( pulse_model_1 )
In [105]:
mcmc_dens_overlay( pulse_model_1 )
In [106]:
mcmc_acf( pulse_model_1 )

Posterior predictive check¶

In [107]:
proportion_of_ghosts <- function(x){mean(x == 1)}
pp_check(pulse_model_1, nreps = 100,
         plotfun = "stat", stat = "proportion_of_ghosts") + xlab("probability of believing in ghosts")
Warning message:
“'nreps' is ignored for this PPC”
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Parameter ranges¶

In [108]:
tidy(pulse_model_1, effects = "fixed", conf.int = TRUE, conf.level = 0.95)
A tibble: 15 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept) -1.72721563680.649998894-3.057031004-0.5112633730
age -0.00879758420.004071164-0.016826448-0.0008942501
income -0.00074265990.001290706-0.003374491 0.0017734899
educationGraduate degree -0.30368289950.198303966-0.688425521 0.0848380810
educationHigh school 0.37532080810.196722888-0.001248007 0.7708382558
educationOther 0.56847124880.500754317-0.445618556 1.5956771172
educationSome college 0.26699022070.174887687-0.072274375 0.6153956820
science_is_honestSomewhat Agree 1.13567684500.544904000 0.107309155 2.2914345614
science_is_honestSomewhat Disagree 0.83607965140.568171033-0.232795020 2.0257495431
science_is_honestStrongly Agree 0.95418724850.552602988-0.068699415 2.1235345838
science_is_honestStrongly Disagree 1.29136643960.573804169 0.217174697 2.4911675867
trump_approvalSomewhat Approve 0.53230323280.346767339-0.122920153 1.2204693874
trump_approvalSomewhat disapprove 0.49533531370.368243727-0.207054762 1.2346362902
trump_approvalStrongly Approve 0.55264767140.327415250-0.068107641 1.2005745420
trump_approvalStrongly disapprove 0.72494956520.311730683 0.128547244 1.3608364941

Significant at 95%-level: age, trump_approval: strongly disapprove, science_is_honest: somewhat agree, strongly disagree

Not significant: income, education, trump_approval: everything except strongly disapprove, science_is_honest: everything else

It remains to be investigated whether only some values of a discrete quantity are significant.

Age¶

In [109]:
c( exp(-0.017), exp(-0.00088), exp(-0.00069) )
  1. 0.98314368463491
  2. 0.999120387086446
  3. 0.999310237995258

Even though the relation is significant, the effective appears to be small (around 0.1% more belief in ghosts for every year of living)

trump approval: strongly disapprove¶

In [110]:
c( exp(0.13), exp(0.71), exp(1.34) )
  1. 1.13882838332462
  2. 2.03399125864675
  3. 3.81904350536634

If somebody strongly disapproves of Trump, the odds that they believe in ghosts are doubled! This seems indeed to be an interesting finding..

science is honest: somewhat agree¶

In [111]:
c( exp(0.11), exp(1.14), exp(2.29) )
  1. 1.11627807045887
  2. 3.12676836518616
  3. 9.87493768117318

science is honest: strongly disagree¶

In [112]:
c( exp(0.22), exp(1.29), exp(2.49) )
  1. 1.24607673058738
  2. 3.63278655575281
  3. 12.0612761204447

Strong factors! Hard to say whether this is actually true.

Simpler model?¶

Only use significant predictors:

In [113]:
pulse_data2 <- pulse_data %>% 
    mutate( trump_disapprove=ifelse(trump_approval=="Strongly disapprove", 1, 0)) %>% 
    mutate( science_is_honest_somewhat_agree=ifelse(science_is_honest=="Somewhat Agree", 1, 0) ) %>% 
    mutate( science_is_honest_strongly_disagree=ifelse(science_is_honest=="Strongly Disagree", 1, 0) ) %>% 
    select( ghosts, age, trump_disapprove, science_is_honest_somewhat_agree, science_is_honest_strongly_disagree )
head( pulse_data2 )
A tibble: 6 × 5
ghostsagetrump_disapprovescience_is_honest_somewhat_agreescience_is_honest_strongly_disagree
<dbl><dbl><dbl><dbl><dbl>
164100
056110
063010
048000
132000
064100
In [114]:
pulse_model_2 <- stan_glm( ghosts ~ age + trump_disapprove + science_is_honest_somewhat_agree + science_is_honest_strongly_disagree,
                             data = pulse_data2, family = binomial,
                             prior_intercept = normal(0.4, 0.6),
                             prior = normal(0, 1, autoscale=TRUE),
                             chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 2.5e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.25 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.533343 seconds (Warm-up)
Chain 1:                0.5704 seconds (Sampling)
Chain 1:                1.10374 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2.2e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.22 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.526281 seconds (Warm-up)
Chain 2:                0.585505 seconds (Sampling)
Chain 2:                1.11179 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.2 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.527449 seconds (Warm-up)
Chain 3:                0.595125 seconds (Sampling)
Chain 3:                1.12257 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 2.1e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.21 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.525411 seconds (Warm-up)
Chain 4:                0.588588 seconds (Sampling)
Chain 4:                1.114 seconds (Total)
Chain 4: 
In [115]:
tidy(pulse_model_2, effects = "fixed", conf.int = TRUE, conf.level = 0.95)
A tibble: 5 × 5
termestimatestd.errorconf.lowconf.high
<chr><dbl><dbl><dbl><dbl>
(Intercept) -0.3270646810.232890583-0.78064462 0.129366642
age -0.0084535450.003948532-0.01638031-0.000616473
trump_disapprove 0.2256553820.137093533-0.03961780 0.488840621
science_is_honest_somewhat_agree 0.2763402270.145920437-0.01127357 0.557672279
science_is_honest_strongly_disagree 0.4947147180.223413631 0.06069460 0.930210683

Now only science is honest strongly disagree and age are significant..

Third and simplest model:

In [116]:
pulse_model_3 <- stan_glm( ghosts ~ age + science_is_honest_strongly_disagree,
                             data = pulse_data2, family = binomial,
                             prior_intercept = normal(0.4, 0.6),
                             prior = normal(0, 1, autoscale=TRUE),
                             chains = 4, iter = 5000*2, seed = 84735)
SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 3e-05 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.3 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.429696 seconds (Warm-up)
Chain 1:                0.501295 seconds (Sampling)
Chain 1:                0.930991 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 2.4e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.439278 seconds (Warm-up)
Chain 2:                0.52289 seconds (Sampling)
Chain 2:                0.962168 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 2.4e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.24 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.430168 seconds (Warm-up)
Chain 3:                0.52817 seconds (Sampling)
Chain 3:                0.958338 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'bernoulli' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1.9e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.423899 seconds (Warm-up)
Chain 4:                0.497535 seconds (Sampling)
Chain 4:                0.921434 seconds (Total)
Chain 4: 

Performance metrics¶

CV (use 5 folds for efficiency):

In [117]:
cv_accuracy1 <- classification_summary_cv(model = pulse_model_1, data = pulse_data, cutoff = 0.5, k = 5)
cv_accuracy1$cv
A data.frame: 1 × 3
sensitivityspecificityoverall_accuracy
<dbl><dbl><dbl>
0.093869420.90558780.597
In [118]:
cv_accuracy2 <- classification_summary_cv(model = pulse_model_2, data = pulse_data2, cutoff = 0.5, k = 5)
cv_accuracy2$cv
A data.frame: 1 × 3
sensitivityspecificityoverall_accuracy
<dbl><dbl><dbl>
0.04080250.97504520.62
In [119]:
cv_accuracy3 <- classification_summary_cv(model = pulse_model_3, data = pulse_data2, cutoff = 0.5, k = 5)
cv_accuracy3$cv
A data.frame: 1 × 3
sensitivityspecificityoverall_accuracy
<dbl><dbl><dbl>
0.016067650.98578160.618

It appears to be difficult to compare the three models. While model 2 and 3 are very similar, model 1 appears to have a larger sensitivity at the cost of a lower specificity and at a similar accuracy.